Currently, data scientists have been called to the front lines as they analyze data from the COVID-19 pandemic. It is not hyperbolic to say that data scientists have saved lives; but in this trying time, they also entertain us. As we all cocoon ourselves in our homes, data scientists refine our Netflix recommendations, identify ISP outages in real time, and keep toilet paper traveling to stores where people need it verses stores with hoarders. In the spirit of this, as data scientists, we chose to perform the following analysis for entertainments sake.
Students across the country have left their college campuses to embrace new online learning communities. This transition has not been easy and school spirit is probably not at an all-time high. One of the iconic representations of school spirit is the college fight song. The following analysis is a data exploration of college fight songs from the Power 5 schools (plus Notre Dame).
A dataset containing college fight songs was acquired from FiveThirtyEight.com. Variables such as the school, the author, year it was written, beats per minute, length, and lyric clichés from the songs were presented. The original article by FiveThirtyEight.com allowed readers to select a school, view it on a graph comparing its length and speed with other colleges’ songs, and then see a list of the clichés in the lyrics. This was a great jumping off point for our analysis. To the original dataset we added four more variables and merged it with a dataset containing university demographic data.
The first variables that were added were the 2019 football wins and losses for the schools in our fight song dataset. This data was obtained from ncaa.com. The next variables we obtained were from niche.com. Niche is a site that provides university information to applying students. The site also creates rankings and letter grades for schools on a wide variety of topics. We chose to utilize their party school rankings and athletic rankings. A school ranked number one is the best in that particular category.
Another valuable source of college information is the Integrated Postsecondary Education Data System (IPEDS). This data is provided by the National Center for Education Statistics. The IPEDS data can be explored via their website and customized datasets can be downloaded. The IPEDS dataset we utilized was one created and shared on Kaggle. The merging of this dataset with the fight song data proved to be a challenge.
When it came to cleaning the data, we mainly focused on the fight_songs and ipeds data sets. Both data sets were relatively clean. readxl was really helpful with ipeds since there were multiple sheets within the data for use.
When it comes to the data the university data within fight_songs\(\subseteq\) ipeds. The first approach we took was alphabetically sorting both datasets and then looping through the ipeds dataset fight_songs amount of times. We would extract universities with similar names to the fight_songs dataset from the ipeds dataset. This was done with pmatch(). A match() call didn’t work well since some of the university names in fight_songs were abbreviated by just their name.
Unfortunately this did not work since some of the university names in fight_songs were so generic that there would be multiple matches from ipeds, additionally, sorting the data alphabetically didn’t work with university names the included the word “The” in their university title. This led to incorrect extracting, and we had to come up with a different solution to cleaning the data.
When it came to bringing the data together we originally used pmatch() to lookup universities. Universities that couldn’t find a match with pmatch() had a separate CSV lookup table that contained university IDs from ipeds and university names from fight_songs.
We eventually dropped the pmatch() looping approach + lookup table and instead just appended a new column containing all university IDs to the fight_songs CSV. This worked well for us because the university IDs used in the ipeds dataset is consistent with other national institutionalized data.
Throughout the process of acquiring the data, merging it into one dataset, and finding new sources we kept a list of questions that we wanted to explore. Some of the most interesting were:
After our data wrangling, we set off to find some answers.
The above interactive plot was inspired by FiveThirtyEight’s original analysis. The original plot graphed song length by speed. To add onto that we made the plot interactive so that the song titles and schools could be viewed. We also color coded the data points by the athletic division of the school.
From this graph we see that the longest song is the Aggie War Hymn for Texas A&M. Auburn has one of the shortest songs and it is also on the slow end. Colorado and Oklahoma also have some of the shortest songs, but theirs are a bit faster. Most songs tend to be short and fast with a cluster of slower songs.
This data was used to create a new categorical variable in our dataset. We divided both the bpm and length in two even halves creating four quadrants: short and fast, short and slow, long and fast, and long and slow.
Next, we wanted to see where, when, and if a student wrote the song in a visual manner.
The above map displays the school locations based on the date their fight song was written. The red points represent schools where a student wrote the fight song and the blue points are schools where someone other than a student wrote the fight song. The trend in school fight songs begins with two schools prior to 1900 with students writing the songs. It then continues throughout the twentieth century. It does appear that there is a trend over time that moves away from student writers. Let’s check this by comparing student verses non-student author rates from the first half of songs written with the second.
As seen above, the first half of songs written were about two-thirds by students and the second half have only one-third by students. The following graph then presents whether the song speed and length categories are correlated with student writers and whether this relationship changes through time.
The songs that are longer and slower are the newest among non-students and songs that are shorter and faster are the newest among student writers.
Moving on from the who and when of fight song writing the following analysis looks into the song lyrics, and more specifically, the clichés in the lyrics.
Fight songs are full of clichés such as trash talking your opponent, cheering to fight, win, achieve victory, yelling your school colors, or even talking about how manly you are. The following collection of bar graphs breaks down each cliché and counts the number of schools that use it in their fight song.
Songs that mention their opponents and spell out words are not very common. Yet songs that mention winning, win, or victory are very common. It is also common to mention your school colors in your fight song.
Now let us look at whether these clichés are correlated with better football performance from last year.
References to school colors has a slightly higher median number of 2019 football wins but a lower IQR. Referencing your opponent on the other hand has a lower median and IQR than those that did not for number of football wins in 2019. Mentioning win or won is also correlated with higher wins this season than mentioning victory.
While the above shows some correlations between certain lyrics and games won by the football teams in 2019, the small sizes of the samples means there is probably not a statistically significant difference. Mostly, it is important to remember that song lyrics do not influence football games. Perhaps though, schools with long and successful athletic histories followed certain lyric trends.
Next, we will look at two of these clichés in more depth: whether a school mentions men, boys, or sons and whether they spell out words.
Schools with fight songs that do not mention men, boys, or sons have a student population median percentage of 52% female while those that do, have a median of 48% female. In fact, there is even one school that mentions men that has only about 32% female students.
This is not to say that females are influenced by fight song lyrics when picking schools, but what does this say about schools that mention men in their lyrics? Were they perhaps all male at one point? Or are schools that mention men less likely to make other campus culture changes that help with female recruitment and retention? This is a great example of a surprising find that could represent complex social dynamics as confounding variables.
One might want to jump to conclusions about students’ spelling out loud in a fight song and their SAT writing scores given the above graphs, but again it is important to remember that correlation does not equal causation and that a small difference does not mean a statistically significant difference. Even so, might there be some underlying roots for an actual difference here? Are those schools with fight songs that spell out words ones with higher acceptance rates? Did trends in school fight song lyric writing diverge from academic minded schools verses athletic minded schools?
To further explore campus culture and fight songs, next we will look at niche.com athletic and party school rankings with our fight songs.
The above interactive map shows the type of song and their athletic and party school rankings. If you like short and fast songs and prioritize athletics and partying, then Alabama is your best bet. If you want to avoid parties but like athletics and want a longer and slower fight song than Baylor is the school for you. Overall, the Power 5 schools perform well athletically and tend to be party schools, but the spread is wider among party rankings than athletic rankings. The above graph might help marching band students when picking a college.
The above graph is beneficial in a few ways. It shows us that having a high (meaning closer to one) athletic rank but low party rank is not very common. We can also see that schools with higher party rankings tend to come from the SEC or Big 10. Maybe most notable, schools with low party ranks do not have song written by students.
To wrap this up we believe there are some solid conclusions we are able to come to:
The above analysis was conducted for Data Science 202 at Iowa State University as a Spring final project by Jessie Bustin, Ann Gould, Matthew Coulibaly, and Henry Underhill. The team worked through github, conducted the analysis in R studio, and presented the findings in this report as well as a PowerPoint presentation.
Ann Gould coordinated communication with the professor and TA. She also completed the data cleaning and merged datasets.
Jessie Bustin found the dataset and led the team analysis and composition of this report. Her analysis focused on creating interactive graphs and the map.
Matthew Coulibaly produced graphs and analysis of the niche.com rankings and their relationships with the song data.
Henry Underhill provided analysis of the song lyrics and their relationships to 2019 football wins and other college demographics.
While the above breakdown shows primary focuses there was a regular flow of workload between team members within the primary areas of focus. The team worked in unison to produce questions for leading the analysis and in search of data to add to the original fight song dataset. The team also worked through google slides in unison to create a presentation before importing into PP where finishing touches were added.
To facilitate this process the team kept in communication through a group text and also met via WebEx to make decisions and divide workloads.